skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Editors contains: "Shao, Mingfu"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Shao, Mingfu (Ed.)
    Graphs are powerful tools for modeling and analyzing molecular interaction networks. Graphs typically represent either undirected physical interactions or directed regulatory relationships, which can obscure a particular protein’s functional context. Graphlets can describe local topologies and patterns within graphs, and combining physical and regulatory interactions offer new graphlet configurations that can provide biological insights. We present GRPhIN, a tool for characterizing graphlets and protein roles within graphlets in mixed physical and regulatory interaction networks. We describe the graphlets of mixed networks in B. subtilis, C. elegans, D. melanogaster, D. rerio, and S. cerevisiae and examine local topologies of proteins and subnetworks related to the oxidative stress response pathway. We found a number of graphlets that were abundant in all species, specific node positions (orbits) within graphlets that were over-represented in stress-associated proteins, and rarely-occurring graphlets that were over-represented in oxidative stress subnetworks. These results showcase the potential for using graphlets in mixed physical and regulatory interaction networks to identify new patterns beyond a single interaction type. 
    more » « less
    Free, publicly-accessible full text available July 21, 2026
  2. Shao, Mingfu (Ed.)
    Identifying novel and functional RNA structures remains a significant challenge in RNA motif design and is crucial for developing RNA-based therapeutics. Here we introduce a computational topology-based approach with unsupervised machine-learning algorithms to estimate the database size and content of RNA-like graph topologies. Specifically, we apply graph theory enumeration to generate all 110,667 possible 2D dual graphs for vertex numbers ranging from 2 to 9. Among them, only 0.11% (121 dual graphs) correspond to approximately 200,000 known RNA atomic fragments/substructures (collected in 2021) using the RNA-as-Graphs (RAG) framework. The remaining 99.89% of the dual graphs may be RNA-like or non-RNA-like. To determine which dual graphs in the 99.89% hypothetical set are more likely to be associated with RNA structures, we apply computational topology descriptors using the Persistent Spectral Graphs (PSG) method to characterize each graph using 19 PSG-based features and use clustering algorithms that partition all possible dual graphs into two clusters. The cluster with the higher percentage of known dual graphs for RNA is defined as the “RNA-like cluster, while the other is considered as “non-RNA-like. The distance between each dual graph and the center of the RNA-like cluster represents the likelihood of it belonging to RNA structures. From validation, our PSG-based RNA-like cluster includes 97.3% of the 121 known RNA dual graphs, suggesting good performance. Furthermore, 46.017% of the hypothetical RNAs are predicted to be RNA-like. Among the top 15 graphs identified as high-likelihood candidates for novel RNA motifs, 4 were confirmed from the RNA dataset collected in 2022. Significantly, we observe that all the top 15 RNA-like dual graphs can be separated into multiple subgraphs, whereas the top 15 non-RNA-like dual graphs tend not to have any subgraphs (subgraphs preserve pseudoknots and junctions). Moreover, a significant topological difference between top RNA-like and non-RNA-like graphs is evident when comparing their topological features (e.g., Betti-0 and Betti-1 numbers). These findings provide valuable insights into the size of the RNA motif universe and RNA design strategies, offering a novel framework for predicting RNA graph topologies and guiding the discovery of novel RNA motifs, perhaps anti-viral therapeutics by subgraph assembly. 
    more » « less
    Free, publicly-accessible full text available July 15, 2026
  3. Shao, Mingfu (Ed.)
    Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca . 
    more » « less